Gwendolyn Eadie, Assistant Professor
November 4, 2019
Department of Astronomy & Astrophysics / Department of Statistical Sciences
Unversity of Toronto
Example: We obtain some data from a star and estimate its temperature.
\( \omega \) is a measurement of the temperature of a star. The event \( A \) that the temperature is higher than 7500K but not higher than 10000K is A = (7500,10000].
The probability here is purely mathematical, based on three axioms (definitions):
\( \mathrm{Pr}(A) \geq 0 \) for every \( A \), where \( A \) is an event.
\( \mathrm{Pr}(\Omega) = 1 \).
If \( A_1, A_2, \ldots \) are disjoint (don't contain any common outcomes), then \[ \mathrm{Pr}( A_1 \; \mathrm{or} \; A_2 \; \mathrm{or} \; \ldots ) = \mathrm{Pr}(A_1) + \mathrm{Pr}(A_2) + \ldots \]
What is the probability of \( A_1 \) and \( A_2 \) if they are disjoint?
\[ \mathrm{Pr}(A_1 \; \mathrm{and} \; A_2 \; \mathrm{and} \; \ldots) = 0 \]
Two events \( A \) and \( B \) are independent if \( \mathrm{Pr}(A \; \mathrm{and} \; B) = \mathrm{Pr}(A)\mathrm{Pr}(B) \)
If \( A_1, A_2, \ldots , A_k \) are a partition of \( \Omega \), then for any event \( B \):
\[ \mathrm{Pr}(B) = \sum_i \mathrm{Pr}(B|A_i)\mathrm{Pr}(A_i) \]
Suppose \( A_1, A_2, \ldots , A_k \) are a partition of \( \Omega \), and \( \mathrm{Pr}(A_i) > 0 \) for each \( i \).
If \( \;\mathrm{Pr}(B) > 0 \), then for each \( \; i = 1, \ldots , k \):
\[ \mathrm{Pr}( A_i | B ) = \frac{ \mathrm{Pr}( B | A_i )\mathrm{Pr}( A_i )}{\sum_j \mathrm{Pr}( B | A_j )\mathrm{Pr}( A_j )} \]
\( \mathrm{Pr}(A_i) \) is the prior probability of \( A \), and \( \mathrm{Pr}(A_i|B) \) is the posterior probability of \( A \). Often you'll see Bayes' theorem written more simply, like this: \[ \mathrm{Pr}(A|B) = \frac{\mathrm{Pr}(B|A)\mathrm{Pr}(A)}{\mathrm{Pr}(B)} \]
In this galaxy, 60% of the stellar systems have an Earth-like planet, and all of these systems also have a Jupiter-like planet. However, only half of the systems without an Earth-like planet have a Jupiter-like planet.
What's the probability that a stellar system with a Jupiter-like planet also has an Earth-like planet?
What's the probability that a stellar system with a Jupiter-like planet also has an Earth-like planet?
\[ = \frac{\mathrm{Pr}(\text{Jupiter-like}|\text{Earth-like})\mathrm{Pr}(\text{Earth-like})}{\mathrm{Pr}(\text{Jupiter-like})} \]
\[ = \frac{\mathrm{Pr}(J|E)\mathrm{Pr}(E)}{\mathrm{Pr}(J)} \]
We are told that \( \mathrm{Pr}(E)=0.6 \) and that \( \mathrm{Pr}(J|E)=1 \)
We are also told, somewhat indirectly, that \( \mathrm{Pr}(J)=0.8 \). This follows from \[ \mathrm{Pr}(J) = \mathrm{Pr}(J|E)\mathrm{Pr}(E) + \mathrm{Pr}(J|\text{not E})\mathrm{Pr}(\text{not E}) \] \[ \mathrm{Pr}(J) = (1)(0.6) + (0.5)(0.4) = 0.6 + 0.2 = 0.8 \]
Now you can solve Bayes' theorem to find \[ \mathrm{Pr}(E|J) = \frac{P(J|E)P(E)}{P(J)} = \frac{(1)(0.6)}{0.8} = 0.75 \]
What other distributions have you heard of?
Relationships between univariate distributions
How do I know what to use when I don't know all the distributions?
Think about the sampling process, and understand what kind of random variable you are dealing with.
The cumulative distribution function (CDF) is \( F_X(x) = \mathrm{Pr}( X \leq x ) \)
In other words, it gives the probability of a random variable \( X \) being less than or equal to the value \( x \).
The probability mass function (PMF) for \( X \) is \( f_X(x) = \mathrm{Pr}( X = x ) \).
It exists only for \( X \) that takes countably many discrete values.
Example: \( X \) could be defined as the number of stars measured until we find 22 blue ones with a mass less than 1 solar mass. Note that we could conceivably count an infinite number of stars before we ever get 22 blue ones less than 1 solar mass (if the universe contains infinitely many stars and we could measure them all!).
The probability density function (PDF) is a function \( f_X(x) \geq 0 \) for all \( x \), which integrates to \( 1 \), and for every \( a \leq b \)
\[ \mathrm{Pr}(a < X < b) = \int_a^b f_X(x) dx \]
If such a function exists, then \( X \) is a continuous random variable with CDF
\[ F_X(x) = \int_{-\infty}^x f_X(t) dt \]
and \( f_X(x) \) is the derivative of \( F_X(x) \) at all points where \( F_X \) is differentiable.
The terms joint distribution, conditional distribution, and marginal distribution get thrown around a lot.
What's the difference?
The joint distribution is the multidimensional probability distribution of two or more random variables.
If X and Y have a joint distribution, then we can write the PMF or PDF as \( f_{X,Y} \). Let's look at an example…
# draw 5,000 samples from a multivariate normal distribution
samples = mvrnorm(n = 5000, mu = c(1,1), Sigma = matrix(data = c(0.3, 0.1, 0.3, 0.5), nrow = 2, ncol = 2, byrow = TRUE))
# name the columns
colnames(samples) = c("theta1", "theta2")
# make samples a data frame
samples = as.data.frame(samples)
The conditional distribution is the distribution of a parameter given a specific value of the other parameter(s).
For example, in the previous plot, we could look at the conditional distribution of \( \theta_2 \) given that \( \theta_1=1.1 \). We denote this as \[ p(\theta_2 | (\theta_1=1.1) ). \]
NOTE: The conditional distribution will look different for different values of \( \theta_1 \). For example, the conditional distribution of \( \theta_2 \) given that \( \theta_1=0.2 \) looks like the green curve on the right: \[ p(\theta_2 | (\theta_1=0.2) ). \]
The marginal distribution is the distribution of a parameter regardless of the values of the other parameter(s).
Suppose \( X \) and \( Y \) have joint PDF \( f_{X,Y} \).
Then \( X \) and \( Y \) are independent random variables if (and only if)
\[ f_{X,Y}(x,y) = f_X(x)f_Y(y) \] for all values of \( x \) and \( y \).
In other words, the indepedence we talked about earlier (\( \mathrm{Pr}(A,B)=\mathrm{Pr}(A)\mathrm{Pr}(B) \)) extends to probability density functions
Expected value, or mean, or first moment. A one-number of summary of a distribution.
If \( X_1, X_2, ..., X_n \) are random variables and \( a_1, a_2, ..., a_n \) are constants, then \[ E\left[ \sum_i a_i X_i \right] = \sum_i a_i E\left[ X_i \right] \]
If the random variables are independent, then \[ E\left[ \prod_i X_i \right] = \prod_i E\left[ X_i \right] \]
Variance summarises the spread of a distribution.
If \( X \) is a random variable with mean \( \mu \), the variance of \( X \) is \[ \mathrm{Var}(X) = E\left[ (X - \mu)^2 \right] = \int (x-\mu)^2 dF(x) \]
(we use \( dF(x) \) to symbolize that this could be for discrete or continuous random variables)
The variance is often denoted by \( \sigma_X^2 \) or \( \sigma^2 \). The standard deviation is \( \sqrt{\mathrm{Var}(X)} = \sigma_X \).
If \( X_1, X_2, ..., X_n \) are random variables (assuming they follow the same distribution), then the sample mean is \[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \]
and the sample variance is \[ S^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2 \]
Now we're getting somewhere! \[ E[\bar{X}]=\mu\qquad\mathrm{Var(\bar{X})=\frac{\sigma^{2}}{n}\qquad E[S^{2}]=\sigma^{2}} \]
If \( X \) and \( Y \) are random variables with means \( \mu_X \) and \( \mu_Y \), and standard deviations \( \sigma_X \) and \( \sigma_Y \), then the covariance between \( X \) and \( Y \) is
\[ \mathrm{Cov}(X,Y) = E\left[ (X-\mu_X)(Y-\mu_Y) \right] \]
The correlation between \( X \) and \( Y \) is
\[ \rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y} \]
(learning for those in computer science…)
Question: Given data \( x_1, x_2, ..., x_n \), obtained by drawing from some unknown distribution \( F \), how do we infer \( F \)?
An astronomer wanted to know the proportion of stars that are in a binary system. Let's pretend that in nature the true proportion of stars in a binary system is 80%. However, they only had a sample size of 70 objects in the sky.
The astronomer's data are binomial; either a star is in a binary system or it is not. They report a proportion of stars in binary systems of \( 0.771 \) with a 95% confidence interval of \( (0.673, 0.870) \).
What does this interval mean? Discuss with a neighbour.
Confidence Interval Applet http://www.rossmanchance.com/applets/ConfSim.html
Hypothesis testing assumes a default theory (null hypothesis) and devises a test to examine whether there is sufficient evidence in the data to reject the default theory.
Our choice of parametric model provides a framework in which to perform statistical inference. Previously, we considered the parameter \( \theta \) to be fixed but unknown.
Under the Bayesian inference paradigm, we instead take \( \theta \) to be random, and consider a distribution for \( \theta \) that represents our belief about its value before and after observing data.